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ABSTRACT 


A study was conducted on the multiple comparison methods 
presented by Scheffé, Tukey, Student-Newman-Keuls, and Dun- 
can under the experimental situation in which all populations 
were normal with equal variances and all means but one were 
equal. The characteristics of all four test procedures were 
compared for the case of multiple comparisons of pairs of 
means. These tests were conducted both with and without the 
prior performance of an analysis of variance. The Tukey and 
Scheffe procedures were compared in tests of linear combina- 
tions of three means. Fstimates were made of the power of 
the tests and of Type I error rates under both the null and 
alternate hypotheses. Scheffe's method was found to be too 
conservative for pairwise comparisons of means, but it was 
to be preferred over Tukey's method for combinations of more 
than two means. Duncan's method was the most powerful test 
of pairwise comparisons, but it maintained little control 
over one kind of Type I error. The S-N-K procedure showed a 


good balance between power and control of Type I errors. 
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im © EV TROBUCTION 


Consider the experiment in which random samples are 
drawn from several different populations in order to test 
for equality of the population means. It is often assumed 
that these populations are normal and that the population 
Varlances are equal but unknown. To be more specific, 


Suppose there are k groups of independent observations 


Aiys Rio: awalte Xin! X54" Roo) 5 ee Kon! =: Aeon Ayqs Ay 9 oleae, 
Xen! from normally distributed populations with means Was 
Hor seer Uy and common variance o¢ (unknown), where Se 


represents the outcome of the sen sample on the i=? »epula- 
tion. Note that it has been assumed that the population 
samples are all of size on. 


To test the null hypothesis, HO? Wy U5 = eee = Ue 


the model I analysis of variance is usually employed. 


The resulting test statistic, formed to test Hoe is: 


_ mean Square between groups 


eee) ' mean square within groups 
k 2 
k(n-1)(n) <= (X.-X) 
_ i=l 7° 
7 k n = 
(k-1) 2 y (x. 


f=1 j=1 72 7 


The null hypothesis is rejected if the test statistic 
exceeds the critical value appropriate for the significance 
level of the test. 

Suppose the experimenter has rejected the null hypothe- 


Sis. In many cases he will want to know which means differ 


ae 


and which means do not. The multiple comparison procedures 
were designed to help answer just this question. The 
research, the results of which are presented in this paper, 
attempted to assess the relative merits of four of the most 
commonly used techniques. The characteristics of the pro- 
cedures presented by Scheffe [O53], Tukey [ E94 oie Duncan 
[1955] and one credited variously to Student, Newman Prbess 1), 
and Keuls [1952] were studied for the case in which all 
population Meats but “one were equal. the tests were uceg 


EO, Ge wean t Hs = Us i # j. One hrief experiment was 


7 f 
conducted to compare the performance of the Scheffe and 


eo 


Tukey methods when testing a hypothesis concerning a linear 


combination of more than two means. 


Ss 


cf. “M@RTIPLF COMPARISON PROCEDURES 


Miller [1966] presents a detailed discussion of the 
test procedures and their underlying theoretical bases. 
This work pues contains a most complete bibliography of the 
field of multiple comparisons. The mechanics of the test 
procedures are presented in the following paragraphs. 

k 10) 2 


Y, y (X, ~ X.) 
i, Fol : Z 


Do il 


Let S = Ela) 


Then s? is an estimator for 67 with k(n-1) degrees of free- 
dom. This estimator was used in conjunction with all of the 


methods throughout the experiment. 


A. SCHEFFF'S METHOD 

Scheffe's method is more general than the other methods 
studied in that it does not require equal sample sizes from 
all the groups. It may also be used to test the hypothesis 
that a general linear function or contrast is zero. The 
presentation here is for the special case of equal sample 
sizes from all populations. 

Consider the linear function or contrast of the popula- 


tion means, 


where 


ae: 


auk 
2... 6 2 
oS 
i=l] 
and its estimate is: 
7K 
eof =~ Sf 62. 
L n = 1 


Scheffe [1953] derived a confidence interval for all 
possible contfastsy” A4;"which in”turn implies a test of = 
significance for the null hypothesis HJ: A= 0. Reject Ho 


Big 
Iu] > ((k-1) FL [k-1,k (n-1) 1 (82)} 


where PF Uk-l, k(n-1l1)]) is the tahulated F-value at the a 


Significance level for vy) = k=-Land Vo = k(n-1). 

For the special case where Ct -l, c. = 1, ana 
Cae Omnis 1 Or a, A°= bani Tie test GRmeerion for 
Ho? tae = De i te Had eee ee 2 FD eEhen-hecomes: 


reject H_ if 
O 


14 





1 


it (kW ko, kin-1) 1 (ssh, 


where the population sample means are ranked in ascending 
order and x. > X,. This may be rewritten as: 
rm - XPS e-1) FO TR= 2 K(n-2)) (259) * 
j ak O, : n 

Bp. BOKEY' S METHOD 

Tukey's method, in contrast to that of Scheffe, was 
designed primarily for tests of simple differences of means, 
Wars although it too is sufficiently general to be 
used for tests of linear combinations of means. It is 
exact only for equal sample sizes from all groups; however, 
modifications have been proposed to allow its use in the 


case of unequal sample sizes [Bancroft, 1968]. As above, 


let 


where 


il! 


For this test the pivotal test statistic is the student- 
ized range rather than the F statistic of the previous method, 
Tukey [1949] has shown that the test of He? = 0s ge 


aa oa Ye 
& 


formed by rejecting a digit 


al 


2 % 
IL| > QO, [k,k(n-1)1) 





where Q, lk, k(n-1)] is the upper 100a percent point of the 
studentized range distribution with parameters k and k(n-l). 
For the special case of pairwise mean comparisons, with 


the population means placed in ascending order and X5>Xiy 





Tl =<, 
7 i 
ana 
Ie e 
; ae 
i=], 


ae test criterion for this case Ws then 


ae 52% 
X. = X. > Q,[k, oa Gare E 
C. STUDFNT-NEWMAN-KEULS (S-N-K) PROCEDURE 

The last two tests to be presented are both classified 
as multiple range procedures. These procedures are not 
adaptable for tests of general linear combinations of the 


population means. The S-N-K procedure was first proposed 


by Newman [1939] and independently by Keuls [1952]. The 


16 





basic idea has been attributed to Student (W. S. Gosset) 
(Miller, 1966]. 

The null hypothesis ae i= --- =a, is to be 
tested against the alternative Hy: Us a Ws 1. oe ty. 3 
Se i= The procedures were designed to ceclare 
which means are significantly different if He is rejected. 
The S-N-K test procedure is as follows: 


1. Arrange the sample means in ascending order of 
magnitude as Xays X(oys gee XK) - 
2. Calculate the p-mean critical differences for 
a = ee og Ty 
2 % 


S 
C, = Ogle, k(n-l)1() 


where Q,[p, k(n-1)] is the upper 100a0 percent point 
of the studentized range distribution with parameters 
p and k(@m- J). 

3. Declare Hd) and HK) 


if X an > 
eis Xj Xia) Ch. k) 
differ significantly accept A and stop the pro- 


Significantly different 

Le Hq) and Hy do wet 

cedure. 

k) Gut pement from H 09) Lae Xn) - X47 Cy 
ae If this does not hold then 


ie 2) 1 
State that HK) does not differ significantly from 


4. Declare H 
and X 


Kooy: H(3)° ee awe | and hence any pair 


of means Has Use Ly 4 = 2, Se wonscy Ko OS apie 


Significantly different. Similarly declare WK-2) 


different from y/,, if XK) - X (1) > C, and Xen) 7 


Sa otherwise, state that HK =-1) does not 


aly k-1! 
Ateter Sileni ticanw!y (from Weqys Woy? ray Way! 
and hence any pair of means Wen Uae Le J = eee 


J 
Kale US NOE STonrEr cantly art ferent. 


iy 


5. Proceed, if necessary, until all groups of size 
Pp, p= 2, 3, ..«-, k have heen declared. Note that 
once an ordered subset P of means of size p has 

been declared not significant, all ordered subsets 
@efP of sizes p=l1, pS2 rm, 2Zallso Mists deer ed 


nonsignificant. 


D. DUNCAN PROCEDURE 

Duncan [1955] proposed a modification of the S-N-K ~ 
multiple range procedure with less conservative critical 
values. The test procedure is the same as that outlined 
above for the S-N-K method except that in step (2) the 
p-mean critical differences are defined by 

g2 
c= Ce [pe Event) 1)(——) ; 


8s 
P p 


where or [p, k(n-l1)] is the upper MODE percent point of 
the brea range with parameters p and k(n-1l) and 

oe =l1- @eo) ome These special percentage points of the 
studentized range have been tabulated by Harter, Clemm, and 
Guthrie [1959]. Approximations exist for both the S-N-K 


and Duncan procedures for the case of unequal group sizes 


[Sarhan and Greenberg, 1962]. 





Iii. ERROR BASES 


In contrast to the situation in which a statistical 
test is performed on two population means, there is no 
universally accepted measure of the relative effectiveness 
of two statistical tests such as that provided by the 
Neyman-Pearson theory. Most if not all the definitions 
offered in the literature are but intuitive extensions of 
the Neyman-Pearson ideas. There are three possible types 
of errors, and in this paper the following definitions were 
used: 


1. A Type I error was said to have occurred if Hs 
was declared different from u. when, in fact, 
a 7 = 

2. A Type II error was said to have occurred if Ws 
was declared not different from We when, in fact, 
Pe al 

3. A Type III error was said to have occurred if us 
was declared greater than U when, in fact, 
Us < Hs. Type III errors were not tabulated in 
the experiments and the definition is given only 


for the sake of completeness. 


ZA. iver 6 ERRORS 
Following the definitions of Bancroft [1968], two 
definitions of Type I errors were used in this paper. The 


per-comparison error rate was defined as the long-run value 


of 


Number of comparisons falsely declared signifigame ; 
Total no. of comparisons in which no true difference existed 


L2 


and the experimentwise error rate was defined as the long- 
run value of 
Number of experiments in which at least one difference 
was falsely declared significant : 
Total number of experiments 
When testing for equality of means among more than two 
population means, Type I errors can occur when Ay is true 


and also when Ay is false. 


mB. TYPE Ir ERRORS 

In this study Type II errors were not assessed directly, 
but instead the concept of power was used. The power of: 
a test for a specified configuration of the population 
means was defined as the long-run value of 


Number of comparisons correctly declared significant 
Number of comparisons in which a true difference existed 


20 


TV ATE OP ee eee 


All of the experiments were simulated on an IBM 360/67 
computer. Standard normal variates were generated using 
the Naval Postgraduate School computer facility library 
Gaussian Random Number Generator (GPN) which is based on 
the general scheme devised by Marsaglia [1964]. The uniform 
random numbers required in the routine were generated using 
an additive-congruential method tested hy Green, Smith, and 
Klem [1959]. This method started with sixteen random num- 
bers, Xa Skies Xa 6 and generated the sequence of random 


numbers X. = + 


-. 
J ea al 


Random Number Generator has been tested for accuracy by 


mE! moe lI, } > lé. The Causcram 


taking means, deviations, skewness, and kurtosis on 35 
samples of 19,000 numbers. The results showed that the 
routine generated distributions with normal enavaere Sites > 
The critical values used in all of the experiments were 
obtained by linear harmonic interpolation (where necessary) 
of the tabulated values. Percentage points of the ieneaewnn) 
F distribution were obtained from those calculated by 
Merrington and Thompson [1943]. Percentage points of the 
studentized range and critical values for Duncan's Multiple 
Range Test were taken from the values derived by Harter, 


Clemm, and Guthrie [1959]. 


“Source: NPS Computer Facility, where comvlete test 
results are available on file. 


on 


A. UNCONDITIONAL COMPARISONS 

The first set of experiments was designed to make possi- 
ble an evaluation of all four multiple comparison techniques. 
Only contrasts which were simple differences between means 
were considered, as these are the only contrasts that the 
S-N-K and Duncan procedures were designed to test. 

A data set, consisting of a sample of size n.(n = 6, 8, 
OPO SOpmor) 40% feommeach of k (kes_3,,047002r, 5) estandard 
normal populations was generated using the Gaussian Random 
Number Generator. The sample means and the estimate of the 
common variance were calculated, and the data set was tested 
using each of the four methods. For each method a compari- 
son was recorded as incorrect if the test declared Ws 
different from Ue Since all means were, in fact, equal; 
otherwise nothing was recorded. If after all comparisons 
WeReemMaAdestheremyas at least .onesincorrectycomparisong tar 
a method, an experimentwise error was recorded for the 
method. 

After completion of the tests with all population means 
equal, one of the population means was made greater than the 
others. This change was effected by simply adding 9.2 to 
the sample mean from population three, The modified 


Xo. 

set of sample means {X,, Xo Kae eae xX, where Xe * X.+0.2, 
waS again subjected to test hy each of the four procedures. 
For each method a comparison was recorded as incorrect if 
the test declared Hs different from Ms when neither 1 = 3 


NOgy) —="s. If the test declared Hs different from lM. and 


Zee 


either i= 3 or j = 3, the comparison was recorded as cor- 
rect. Experimentwise errors were recorded as before. 

In a Similar manner, the mean of population three was 
increased in steps of size 0.2 to a maximum value of 2.0, 
and the tests were repeated at each step. Upon completion, 
this made up one replication of each of eleven experiments 
(one experiment for each value of H3). Note that thegeomly 
difference between the data tested at the various stages was 
the value of Xo. The other sample means remaine@€ unchanged. 

The experiments were repeated for a total of 500 repli- 
cations for each value of n and k, and then estimates of 
power and error rates were calculated. This procedure was 
repeated for sample sizes of n = 6, 8, 10, 20, 39, and 40, 
fer edch.value of k= 3, 4, and 5, using both 5%.and 18% 
Gaiticak values. 

l. Error Rate and Power Calculations 

The parameter d was defined as the true difference 
Ha 7 Ha, 1 =1, 2, ...-, k, 17% 3. Power was then tabulated 
as a function of d. The power and error rate estimates 


were calculated as follows: 


io eekes ad = 0 (ae true), 


O 
: No. of “icorrece’ “COm@ar isens 
Per-comparison error rate = ke ’ 
(5) x 500) 


and 


No. of experimentwise errors 
500 


Experimentwise error rate 


2) rop tas 6 (Ho false), 


ZS 


No.of "incorrect" comparieGnis 


Per-comparison error rate 


a) x 500 
Experimentwise error rate = ee : 
and 
Pore: ‘tay = No. of "correct' comparisons when ee. 


(=a) 2500 


B. CONDITIONAL COMPARISONS 

In many actual applications the investigator first 
tests his data for equality of means with the appropriate 
analysis of variance procedure. Only upon rejection of the 
hypothesis that all means are equal does he look to a mul- 
tiple comparison technique for an answer to the question, 
"How do the means differ?" In an attempt to ascertain how 
this procedure changes the error rates and power function 
and whether it is advisable to perform the analysis of 
variance before proceeding, a second set of experiments was 
conducted. In these experiments only those sets of data 
which were declared significant by a model I analysis of 
variance procedure were tested using the four multiple | 
comparison techniques. In other words, only those samples 
which the investigator adhering to the philosophy described 
above would have tested were subjected to test. The test 
procedure and configurations of the means were the same 
as those described for the unconditional comparisons. Suf- 
ficiently many sets of data were generated to provide for 


exactly 500 experiments for each configuration of the means, 


| 
24 | 
| 


Hoge uF, U5 = OF, W325 CL rr ar LW = agli A a ar a i 
2.0, and consequently the calculations were made in the 
manner described for the unconditional comparisons. 

Hor eaeh vale of d, the first 500 sets of data failing 
the F-test were used for test by the multiple comparison 
procedures. For small values of d the generated data sets 
were less likely to fail the F-test than those for large 
values of d. Consequently the sets of sample means tested 
by the multiple comparisons procedures for different values 


Semewotirerca not only in the value of X, but also in tme 


é) 
values of the other sample means. Recall that in the un- 


conditional comparisons only the value of X. changed for 


3 
different values of d (for any fixed values of n and k). 
This procedure seemed more advisable than using only the 
500 sets of data which failed the F-test when od = 0 and 


then incrementing X, in steps of size 0.2 in those sets of 


means. 


©. SCGhMRASTS OTHER THAN SIMPLF DIFFERENCES. OF THO MEANS 
Scheffe [1953] has pointed out that although the Vurkey 
procedure gives shorter confidence intervals than the Scheffe 
procedure for contrasts which are simple differences of two 
meetis, c’ = (WY 0, ...,%% 1, 0, ...,; 0, -l, 0, ..+, 0) tite 
opposite situation may hold for other linear combinations 
of the means. For comparison one set of 500 experiments, 
all at the 5% significance level, was performed on samples 


of size n from three normal populations to test contrasts of 


fe 


the. .£ ommenlie2ecu aes) aos leeee— 2p 1) aeandeallealia Population 
i was N (Las 1), where Wy = Us 


., 2.0. The eleven different configurations of the means 


= 0 and U3 = CU (6 geen Orme 05 Za 


thus produced sixteen different values for the contrasts, 


enemas olay Azan 17a ees mee O | 0 eee 8 TO 
After generation of the data sets, the three contrasts 
described above were formed in succession and were subjected 
to test using the Scheffe and Tukey procedures. The con- 
fidence intervals for the contrasts formed by the two 
different methods were tested for inclusion of zero, and 
the results were recorded. 
1. Frror Rate and Power Calculations 
Type I errors, defined as declaring >’ # 0 when 
X = 0, were possible only when d = 0. For d #0 all of 
the contrasts were different from zero. The estimates of 
Type I error rates for each value of n were calculated as 
follows: 
Number of contrasts declared 


Giiake ment. [mom 2Zerouwmer a—U 
ligsO 10) 


Number of experiments in which 
at least one of the contrasts 
declared # 0 when d = 0 

oioie 


Per-comparison error rate 


Fxperimentwise error rate 


The estimate of power was calculated as a function 
of the ahsolute value of the true value of the contrast, 


X, as follows: 


26 





Number of contrasts declared different 
_ from zero when true value = A 
 Nawlbersof tests of eontrasts when true 
walwe = xX 


Power (|A|) 


The possible non-zero values of i} were -0.2, -9.4, ..., 
—7 Se Z2et a -2.0~ «as, —4.0, and due to the nakuwre wt mae 
experimental procedure not all values occurred with the 

same frequency. The values 0.4, 0.8, 1.2, 1.6, 2.0 occurred 


P5000 times; 0.2, 0.6, 1.0, 1.4, 1.8 occurred ]00W times, age 


autin Zaae Sarg MSE, 4.0 OCceCUurred 500 times. 
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Vs RESULTS 


The experimental results are compiled in Appendix A 
and Appendix B. These results were studied in an attempt 
to pinpoint the characteristics of the methods used and any 
significant differences between them. The following para- 
graphs contain detailed discussions of the experimental 
results and, where possible, rankings of the tests based on 
the several criteria mentioned earlier. It should bem 
membered that these results pertain to very specific con- 
figurations of the population means. Further, the results 
can only be considered approximately correct for the cases 
examined since they were subject to statistical variation. 
For example, the standard deviation of the experimentwise 
error rate of an approximately five percent test was nearly 
0.01, and for a one-percent test it was about .N0N45. In 
Spite of these problems, the results did indicate some 
obvious differences among the methods and permitted some 


fairly general conclusions. 


A. UNCONDITIONAL COMPARISONS 
The results of the unconditional experiments are pre- 
sented by category in the following paragraphs. 
I. speed ErrorswUnder ake 
The estimated per-comparison and experimentwise 
error rates when the null hypothesis was true are displayed 


in Table I. The experimental results indicated that these 
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error rates were independent of the sample size, and the 
values shown in the table were obtained by averaging the 
rates obtained for the six different sample sizes. The 
estimated per-comparison error rates for all but the Duncan 
procedure decreased with increasing values of k. Among 
those for the Duncan procedure the trend was less clear, hut 
the per-comparison error rates clearly did not increase 
with increasing values of k. This trend is what one should 
expect, considering the general philosophy of multiple 
comparisons. Independent of the significance level of the 
tests and the number of means being tested, the ordering 
of test procedures from low to high based upon per-comparison 
error rates under Hy was: Scheffe < Tukey < S-N-K <Dumean. 
The experimentwise error rate for the Duncan tests 
increased rapidly with increasing k, and for k = 5 it had 
reached 0.195 using a five percent test. For the other 
tests no clear relationship with k presented itself. As 
before an ordering, independent of the significance level 
and k, was possible. The ordering from low to high based 


upon experimentwise error rates was: 
Scheffe < Tukey = S-N-K < Duncan 


Note that although the Tukey and S-N-K procedures 
yielded different per-comparison error rates, their experi- 
mentwise error rates were identical. The Puncan procedure 
was designed to place primary emphasis on the control of the 


per-comparison error rate, and the experimental results 


point out a relative lack of control of the experimentwise 
error rate. The Scheffé procedure appeared in diff re uplaey 
in the opposite direction. One should have expected a five 
percent test to yield an experimentwise error rate of about 
five percent. The experimental results showed that the 
experimentwise error rate was significantly lower, indicat- 
ing the procedure is overly conservative for contrasts of 
this type. 

2. Type I Exnmers under #shemAistern atewly peismiesis 

The per-comparison and experimentwise error rates 
when the null hypothesis was false are given in Tables II 
and III. Table II shows the rates for the Scheffe and Tukey 
procedures, which did not depend upon d. The error rates 
for the two multiple range techniques did indicate a 
dependence on d, and these results are displayed in Table 
III. This difference arises because the credo of the mul- 
tiple range tests requires that two means cannot he declared 
Significant unless every subgroup of the k means which 
contains them is declared significant and because different 
critical values are used to test groups of means of differ- 
ent size. The dependence on d was much more pronounced for 
the S-N-K method than for that of Duncan. 

The results showed that the per-comparison error 
rates for all but the Duncan procedure decreased with in- 
creasing k. The dependence in the case of the Duncan method 
was not clear, but it was not increasing. The ranking from 
low to high of the procedures for per-comparison error rates 
was: Scheffe < Tukey <S-N-K =<" Donegan 
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The estimated experimentwise error rates when H, was 
false showed no obvious dependence on k for the Scheffe and 
S-N-K procedures, slightly increasing rates for the Tukey 
method, and a pronounced increasing relationship for the 
Duncan test. Within the range of k values considered, all 
tests except the Duncan test maintained their experiment- 
wise error rates near or below the labeled significance 
level of the test; the Duncan procedure demonstrated little 
control over this type of error. The ranking from high to 


low based on the experimentwise error rate was: 
Scheffe < Tuleey < S-N-K <_Duncan 


3. Power 

Appendix B contains a series of plots of estimated 
power as a function of the true difference d and the sample 
size n. All of the procedures showed increasing power as 
the sample size increased. Power of the Scheffe and Tukey 
procedures decreased as the number of population means being 
tested was increased. The S-N-K procedure showed a Similar, 
but less pronounced, decrease in power; however the Duncan 
procedure showed little, if any, decrease in power as k 
increased. For all of the cases studied there was a clear 
ordering of the tests based on power in detecting differences 
between pairs of means. The ordering from least powerful to 


most powerful was: 


Scheffe < Tukey < S-N-K < Duncan 
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B. CONDITIONAL COMPARISONS 

In many respects the results of the conditional compar- 
isons were similar to those of the unconditional experiment. 
The changes noted occurred primarily when the null hypothe- 
Si Savage true andelon smal lavalveswort acl aAced lecaneukanage 
the results approached those of the unconditioned experi- 
ments. This was not surprising since for d large enough one 
would expect the two experiments to be providing identical 
sets of means for test by the multiple comparison procedures. 
The following paragraphs point out the characteristic 
changes induced by the conditioning process. Even though 
not specifically pointed out in every case, it should be 
kept in mind that these differences diminished with increas- 
ing values of d. 

l. Type I Errors Under Ho 

The orderings of the multiple comparison techniques 

based on experimentwise and per-comparison error rates 
obtained for the unconditional experiment were not changed. 
The magnitudes of the error rates increased greatly as a 
result of the conditioning process as was expected. To 
answer questions about the advisiability or necessity of 
first testing for equality of means, the law of total 
probability was used to calculate the difference in overall 
Ceuomnmrates resulting Erem COneirtioning. The correct, prob- 


ability statement was: 
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Pree 1 error by a MC procedure | Ho) 

= Pr {Type I error by a MC procedure | Type I 
error by ANOV procedure, Ho} x Pr {Type I error 
by ANOV procedure|Ho} + Pr {Type I error by MC 
procedure|ANOV procedure correct, Ho} x Pr {ANOV 
procedure correct|H,} 


=Fr Tivpe © by MC|Type I by ANOV} a 
+ Pr {Type I by|MC ANOV currect} (1-a) 


Where OemePime, {Type I exror by ANOV|/He} 


The quantity on the left hand side was estimated for 
each test procedure from the results of the unconditional 
experiments. The results of the conditional experiments 
gave the probabilities of a Type I error by the multiple 
comparison methods given that a Type I error was made in 
the analysis of variance procedure under Ho: The nominal 
Significance level of the analysis of variance used in the 
experiment was a = .05. Data were not collected to evaluate 
the second term on the right-hand side because of the large 
number of tests which would have been required. The two 
Known quantities were calculated from the experimental 
results and compared. These calculations showed that for 
both kinds of Type I errors by the Scheffe, Tukey, and 
S-N-K methods the known quantities were nearly equal in all 
CaweS. Accordingly, the probability of a Type I error by 
these methods when the analysis of variance correctly accepts 
Hy must be very small, possibly zero. This result indicated 
that for these methods, the prior performance of an analysis 
of variance offered little or no additional protection 


against a Type I error of either kind. It did not mean, 
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however, that the analysis of variance and these multiple 
comparison methods were equivalent since the estimated 
experimentwise error rates of the multiple comparison pro- 
cedures were all less than one. In other words there were 
trials on which the analysis of variance incorrectly rejected 
the null hypothesis, but the multiple comparison procedures 
did not reject. 

In the case of Duncan's procedure, the experiment- 
wise error rate was one or nearly one for all values of k 
used in the experiment, indicating that whenever the analy- 
Sis of variance incorrectly rejected Hoe the Duncan test 
also rejected. Further, the unconditional probabilities of 
Type I errors were significantly larger than the calculated 
values of the overall Type I errors resulting from con- 
dittitoning. In contrast to the results for the othere@emizee 
methods, this indicated that the prior performance of an 
analysis of variance did offer significantly increased 
protection against Type I errors under the null hypothesis 
and maintained the experimentwise error rate at or near 
the nominal significance level of the test. 

2. Type I Frrors Under the Alternate Hypothesis 

When the null hypothesis was false both kinds of 
conditional Type I error rates were higher than the uncon- 
ditional Type I error rates for all cases studied. [In 
contrast with the unconditional rates, the rates for the 
conditional experiment were not independent of the sample 


size. The rates decreased as the sample size increased - the 
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decrease being more pronounced for smaller values of a. 
Contrary to the previous results, both kinds of Type I 
error rate decreased with increasing values of d, and they 
approached the unconditional rates for large values of d. 
The conditional per-comparison error rates decreased with 
increasing values of k for small values of d and showed the 
same mixed tendencies as the unconditional rates for large 
d. The conditional experimentwise error rate for the 
Scheffe procedure decreased with k; all other procedures 
displayed increasing rates as k increased, slowly increasing 
for the Tukey and S-N-K meeHiGts! and rapidly so for Duncan's 
procedure. The results showed in general that when d was 
small, there was a fairly high probability that the differ- 
ences declared significant by the multiple comparison pro- 
cedures would be the wrong ones. 
3. Power 

The conditional power figures, shown in Appendix B, 
were greater than those for the unconditional experiment 
as was expected. The greatest increase was for small values 
of d and the differences between conditional and uncon- 
ditional power decreased with increasing d. The conditional 
power curves were not as smooth as the curves of uncon- 
ditional power. This lack of smoothness was probably 
caused by the method of selection of data sets to be used 
for the multiple comparisons tests. The irregularities 
resulted from the statistical variation between the data 


tested for two different values of d. In the unconditional 
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experiments this variation was eliminated by using the same 
sets of data for all values of d. All other characteristics 
of the power function remained unchanged under conditioning, 


and the ordering of test methods based on power was unchanged. 


Oa CONLRASTS7OF THREE MEANS 

The final experiment, consisting of tests of contrasts 
of three means, showed that for this type of contrast the 
Tukey procedure yielded smaller Type I error rates than 
Pie scHette procedure, The fype | error rates appeared we 
be independent of the sample size and were averaged to 
obtain the estimated error rates. The estimated per- 
comparison error rates of the five percent significance 
level tests were 0153 for the Scherte methodeand . Owe 
for the Tukey method. The estimated experimentwise error 
rates of the five percent tests were .0393 for Scheffe 
amc .OZ07> ror Tukey. 

The curves of estimated power plotted in Figures 10 
and 20 showed that as predicted the Scheffe test was more 


powerful than that of Tukey for this type of contrast. 
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VL... CONCLUSIONS 


Although the experiments were conducted using only one 
of the many possible ways in which a group of several popu- 
Latien means could differ, the results obtained should be 
applicable for other cases as well. Scheffe's procedure 
seemed far too conservative when used to test comparisons 
of only two means. This result was not unexpected, since 
the method is completely general whereas the others were 
designed more specifically for.contrasts of this type. On 
the other hand, in cases where blends or mixtures could be 
important, the Scheffé procedure would be a better choice 
than that of Tukey. 

The appropriate choice among the remaining three methods 
for use when only contrasts of two means are important 
seemed to depend upon the relative importance the experi- 
menter attaches to the two different kinds of Type I errors. 
It was concluded that the experimenter whose primary concern 
is control of the per-comparison error rate and is not 
worried about the experimentwise error rate should use Dun- 
ecan's method, but only after first testing for equality of 
the means with an analysis of variance procedure. 

For the true multiple comparisonist who desires to con- 
trol the experimentwise error rate, the choice lies between 
the S-N-K method and that of Tukey. The two procedures have 
identical experimentwise error rates under the null hypo- 


thesis, but power, per-comparison error rates, and 


cy, 


experimentwise error rates under the alternate hypothesis 
all differ. The S-N-K procedure maintained the experiment- 
wise error rate under the alternate hypothesis near the 
significance level of the test, whereas the Tukey method 
was More conservative. The over conservatism of the Tukey 
method resulted in lower power for all values of d andn, 
and consequently the S-N-K method appeared to be the better 
che@a.ce;. 

Figure 39 shows the estimated power of all four methods 
Side by side for a typical case to aid visualization of the 


magnitude of power differential involved. ' 
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APPENDIX A 


T2eLE I 


ESTIMATED PER-COMPARISON (PC) AND EXPFRIMFNTWISE (E) 
BReOR RATES UNDER THF NULL HYPOTHESIS 





Test Experiment % Type k=3 k=4 k=5 
Scheffe uncend . 1 PC CZ 5 1OiOeE .0006 
Tukey uncond. 1. Pe O00 oer . OCR 7 
S-N-K wicond. Ih PC .0040 010125 O12 1 
Duncan umcond. uh PC OG Bi - Otay, .0066 
Scheffe une@end. 5 Ee . Ol. 48 .0049 .0028 
Tukey Uumeome . 5 re O20 .0089 A Oa 
S—H—K uncond. 5 Be . €236 .0114 . O12 
Duncan uncond. 5 RC . 02 28 - 0368 .0356 
Scheffe umcond. 1 5 Os . OU Omi] 
Tukey uncond. 1 E . OID" ORIGO .0140 
S-N-K uncond. 1 E . 0097 . O1OO .0140 
Duncan umcend . alt E .0200 -0z50 . CSEO 
Scheffe uncond . 5 E TOSS .0243 OOZES 
Tukey uncend . 2) E; .0480 .0417 . 0507 
S-N-K wmeon d . 5 E .0480 .0417 O50 7 
Duncan Uneenc. 5 E - 0977 ~1443 L959 
Scheffe cond. 5 Ee . 20 5 « LOB6 .0456 
Tukey cond . =) EC ssa49 Pes Als ~1114 
S-N-K cond . 5 re ~4474 22 0 . 1490 
Duncan Cera. 5 EG Mallu . 3604 » Zao 
Scheffe cond. 5 1g .6497 5863 823 
Tukey cond. 5 FE 20 6857 animes 
S-N-K Cond . 5 F .8910 Rc) 5) S97 Pa gl BS 
Duncan comid . > FE ie 0900 Pele 220) « FO" 
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TABLE II 


ESTIMATED PER-COMPARISON (PC) ANDULEXPERIMENTWISE (F) ERROR 
RATES FOR THE TUKEY AND SCHEFFE PROCFDURES UNDER THE 
ALTERNATE HYPOTHESIS IN THE UNGONDIPPONAL BHPERSMENTS 


Test Level Type k=3 k=4 k=5 
Scheffe 13 Pee er N023 ononet .0008 
Tukey iL 1G 20087 .0019 ~ieaLo™ 
Scheffe 5 PC BOL 70 .0042 .0030 
Tukey S iC. - 0.2106 .0086 ~0077 
Scheffe i E .0023 .0026 .0043 
Tukey tb E 003% -0050 O:0'9"7 
Scheffe 5 F Ole fd bake .0143 
Tukey 5 E .0206 .0230 .0343 
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